202602130920 - swebench-verified-vs-pro

Main Topic

Question: What is the difference between SWE-bench Verified and SWE-bench Pro, and how should I read model comparison charts that show both?

SWE-bench is a benchmark where an agent is given a real repository plus an issue description, and must produce a patch that makes newly added fail-to-pass tests pass while keeping existing pass-to-pass tests passing. The score is therefore a measure of end-to-end software issue resolution under a fixed evaluation harness.

SWE-bench Verified and SWE-bench Pro are two different attempts to make this measurement more reliable, but they target different failure modes.

SWE-bench Verified is a curated subset of SWE-bench designed to reduce false negatives caused by benchmark noise. In the original SWE-bench test set, some tasks are effectively underspecified, have overly specific tests, or suffer from environment brittleness, which can cause correct or reasonable fixes to be marked wrong. Verified was created via human screening by professional software developers to keep only tasks that are well-scoped and solvable, plus improvements to evaluation reliability (containerized harness). In practical terms, Verified tends to measure how well an agent can solve medium-sized, realistic Python OSS issues when the task specification and evaluation are not sabotaging the solver.

SWE-bench Pro is designed to be a harder, more contamination-resistant benchmark that better reflects long-horizon professional engineering tasks across more diverse, industrially relevant codebases. Scale’s description emphasizes (1) stronger contamination resistance by construction (e.g., copyleft-licensed OSS subsets and private codebases), (2) greater task diversity beyond a small set of common utility libraries, (3) less oversimplification (issues are augmented by humans to be solvable without being trivial), and (4) reproducible environments built and validated by professional engineers. It also tends to require larger patches across more files, which increases the need for planning, codebase navigation, and sustained tool use.

How to interpret comparison charts that show both:

Treat the two bars as measuring different regimes.

Verified is closer to a clean, well-posed version of the classic SWE-bench task.
Pro is closer to long-horizon, high-variance, contamination-aware, real-world engineering.

Expect a large performance drop from Verified to Pro.
If a model (or agent scaffold) drops from around 70 percent on Verified to around 20 percent on Pro, that does not necessarily mean it got worse at coding; it often means Pro is testing additional capabilities that Verified only weakly stresses: multi-file reasoning, sustained execution, better context gathering, and robustness to more realistic repository structure and tooling.
Use the gap as a diagnostic.

High Verified, low Pro often indicates strong local code-editing ability but weaker long-horizon agent behavior (planning, retrieval, iterative debugging) or weaker generalization when contamination risk is reduced.
A smaller gap (even at a modest absolute score) may indicate better agentic workflow and robustness.

Read the metric definition carefully.
Both benchmarks typically report a resolve rate (pass at 1), where a task counts as solved only if the patch makes fail-to-pass tests pass and does not break pass-to-pass tests. Charts can still be misleading if different leaderboards use different scaffolds, tool budgets, or environment constraints. When comparing models, prefer results produced under a single, consistent scaffold.

Practical takeaway: use SWE-bench Verified as a baseline for evaluating whether a model can solve well-posed OSS bugfix tasks, and use SWE-bench Pro to evaluate whether it can operate as a robust software engineering agent in harder, less gameable conditions. A chart that shows both is most useful for understanding where the system’s bottlenecks shift from code generation to end-to-end agent competence.

🌲 Branching Questions

What exactly does “Verified” verify?

Verified is primarily verifying benchmark quality: the issue description is sufficiently specified, the tests reflect the intended fix rather than hidden requirements, and the environment/harness is reliable enough that correct solutions are not rejected for incidental reasons. It is a quality-controlled slice of SWE-bench meant to reduce systematic underestimation caused by noisy tasks.

What makes Pro harder beyond “more tasks”?

Pro is designed to introduce harder and more realistic conditions: broader codebase diversity, longer-horizon tasks, larger multi-file changes, stronger contamination resistance, and human-built reproducible environments with human checkpoints for requirements and test relevance. These factors increase the need for exploration, planning, and iterative debugging rather than single-shot patch writing.

How should I use these benchmarks when choosing a model for my team?

If your workflow resembles quick bug fixes in familiar OSS-style repos, Verified correlates more directly. If you are building an agent that must navigate unfamiliar repos, apply multi-file refactors, and iterate over failures, Pro is a better stress test. In both cases, consider running a small internal evaluation suite on your own codebase to complement public benchmarks.

How should I read a bar chart comparing models on both benchmarks?

Read it as two dimensions:

Level (absolute score): how often the system resolves tasks under that benchmark.
Gap (Verified minus Pro): how much performance depends on easier conditions (better posed tasks, higher overlap with common training data, shorter horizon). The gap can be a proxy for how agentic and robust the system is.

What are common pitfalls when people cite SWE-bench scores?

Comparing numbers produced under different scaffolds or tool budgets.
Overweighting Verified as a proxy for general software engineering ability.
Ignoring contamination risk and benchmark gaming.
Ignoring variance: different repos and issue types can have very different difficulty distributions.

References

SWE-bench repository (overview, datasets, tooling): https://github.com/swe-bench/swe-bench
OpenAI report introducing SWE-bench Verified (human-validated subset; motivation and methodology): https://openai.com/index/introducing-swe-bench-verified/
Scale AI SWE-bench Pro public leaderboard (overview and methodology summary, resolve rate definition): https://scale.com/leaderboard/swe_bench_pro_public